feat(iceberg): Add bucket function #13174

jinchengchenghh · 2025-04-28T12:47:58Z

The implementation aligns with https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/transforms/Bucket.java

And described in Iceberg document https://iceberg.apache.org/spec/#partition-transforms

Resolves: #13980

netlify · 2025-04-28T12:48:19Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`8d59d01`
🔍 Latest deploy log	https://app.netlify.com/projects/meta-velox/deploys/68b56368be3a6b000858776f

velox/functions/iceberg/util/CMakeLists.txt

jinchengchenghh · 2025-04-29T14:30:33Z

Can you help review? Thanks! @rui-mo @zhli1142015

rui-mo

Thanks! Added two initial comments.

velox/functions/iceberg/BucketFunction.cpp

velox/docs/ext/iceberg.py

velox/functions/iceberg/util/CMakeLists.txt

CMakeLists.txt

mbasmanova · 2025-07-01T10:58:59Z

@jinchengchenghh Is there a GitHub issue that provides overall context for this work? If not, would it be possible to create one and explain what are we trying to achive?

jinchengchenghh · 2025-07-02T01:47:48Z

I create an issue #13980 just recently @mbasmanova

velox/docs/functions/iceberg/bucket.rst

velox/functions/iceberg/util/Murmur3_32HashFunction.h

velox/functions/iceberg/BucketFunction.cpp

velox/functions/iceberg/util/Murmur3Hash32.h

velox/functions/iceberg/util/tests/Murmur3_32HashFunctionTest.cpp

jinchengchenghh · 2025-07-03T10:36:07Z

Could you help review this PR? Thanks! @rui-mo

mbasmanova

@jinchengchenghh Let's split this PR into one that adds hash functions and the one that adds BucketFunction.

velox/docs/functions/iceberg/functions.rst

velox/functions/iceberg/BucketFunction.cpp

mbasmanova · 2025-07-03T10:58:32Z

velox/functions/iceberg/BucketFunction.cpp

+  template <typename TInput>
+  FOLLY_ALWAYS_INLINE Status
+  call(int32_t& out, const int32_t& numBuckets, const TInput& input) {
+    VELOX_RETURN_IF(


Should we add some more macros to reduce boiler plate?

We have a throwing version:

VELOX_USER_CHECK_GT(numBuckets, 0, "Invalid number of buckets")

Perhaps, we could add non-throwing version:

VELOX_USER_RETURN_GT(numBuckets, 0, "Invalid number of buckets")

CC: @majetideepak @pedroerp @rui-mo

velox/functions/iceberg/BucketFunction.cpp

velox/functions/iceberg/util/Murmur3Hash32.h

velox/functions/lib/Hash.h

velox/functions/iceberg/util/Murmur3Hash32.h

rui-mo

Thanks.

velox/functions/iceberg/BucketFunction.cpp

velox/docs/functions/iceberg/functions.rst

velox/functions/iceberg/BucketFunction.cpp

velox/functions/iceberg/tests/BucketFunctionTest.cpp

yingsu00

@mbasmanova @jinchengchenghh
Murmur3 is the hash function used by Iceberg's partition transform bucket(c,n). Its role is analagous to the hashXXX() functions in HivePartitionFunction. In general, a PartitionFunction for a specific connector is used in the following usage Scenarios:

Data Exchange in Queries (e.g., joins or aggregates):
- It determines which reducer/task a row should go to using PartitionFunction::partition(), based on the hash of the partition keys. The hash algorithm is different from connector to connector. E.g. Iceberg uses Murmur3, while Hive uses Java's hashCode() function. In most cases it's just simple bit shift and xor.
- The bucket function ensures bucketed joins align buckets from both tables. Hive bucket function is usually just the mod of the hash value: bucket = hash(key) % num_buckets. This could result in a negative value, while Iceberg uses consitent bucket function that gurantees a positive value.
File Writing:
- When inserting into a partitioned table, Hive computes the partition number using HivePartitionFunction::partition() based on the algorithm mentioned above.
- It maps a row’s to a bucket using the connector-specific bucket function based on the bucket calculation method mentioned above.
Connector Specific Functions
In Iceberg, users can directly query the partition functions, or use them in the filters. E.g. SELECT * FROM t WHERE bucket(c,4) = 3, or SELECT day('2002-01-01'). Note that these functions are Iceberg specific and have their unique semantics. For example, DAY(dt) is the number of days from 1970-01-01, while Presto default DAY(dt) is just the day in that particular month.

THis PR is only touching the 3rd case. However in a more holistic view, we will need to add IcebergPartitionFunction shortly. If we want to keep it aligned with Hive, we may want to put the murmur3 hash function and bucket function in IcebergPartitionFunction (HivePartitionFunction keeps the hash algorithm implementations for HIve). Then when we are registering the iceberg bucket(c,n) function, it can call the hash algorithm implementations from IcebergPartitionFunction. As an alternative, if we want to put all hash algorithms in a central place, we may need to refactor HivePartitionFunction and move all its hash implementation there too. That way we can keep the Hive and Iceberg implementations consistent, and meke the repo more manageable and readable.

I also wonder how we should set up the framework for connector specific functions. E.g. consider catalog validation (ensures these functions run on the correct catalogs, i.e. a user should not be able to run an Iceberg function on Hive tables). Also shall it be in a different "iceberg" namespace? We'd love to hear more suggestions.

mbasmanova · 2025-07-08T12:25:33Z

@yingsu00 Ying, thank you for sharing detailed context. This is very helpful.

If we want to keep it aligned with Hive, we may want to put the murmur3 hash function and bucket function in IcebergPartitionFunction (HivePartitionFunction keeps the hash algorithm implementations for HIve). Then when we are registering the iceberg bucket(c,n) function, it can call the hash algorithm implementations from IcebergPartitionFunction.

I like this option. It would be helpful to hide the details of the hash computation inside these partition functions and avoid creating classes with duplicate widely-known names (Murmur3) and slightly different implementations.

I also wonder how we should set up the framework for connector specific functions. E.g. consider catalog validation (ensures these functions run on the correct catalogs, i.e. a user should not be able to run an Iceberg function on Hive tables). Also shall it be in a different "iceberg" namespace?

We do not have namespaces in Velox. It might be helpful to add these at some point. In the meantime, we recommend using prefixes.

a user should not be able to run an Iceberg function on Hive tables

Why not?

jinchengchenghh · 2025-07-09T04:23:21Z

For Data Exchange in Queries, in Gluten, it calls the function as other hive functions in the same way, Gluten does not call Velox Exchange.
Exchange hashpartitioning(staticinvoke(class org.apache.iceberg.spark.functions.BucketFunction$BucketInt, IntegerType, invoke, 3, a#6, IntegerType, IntegerType, false, true, true), 1), REBALANCE_PARTITIONS_BY_COL, [plan_id=758]

yingsu00 · 2025-07-12T13:54:09Z

a user should not be able to run an Iceberg function on Hive tables
Why not?

@mbasmanova Thanks for your information. In my understanding, different table format specifications(Hive, Iceberg, etc) support different functions. They differ in the following aspects:

The set of functions.

Hive provides many built in functions: https://hive.apache.org/docs/latest/languagemanual-udf_27362046/.
Iceberg only defines partition-transform functions: https://iceberg.apache.org/spec/?utm_source=chatgpt.com#partition-transforms.

Functions with same names may have totally different meanings:

Iceberg day(ds) is a date or timestamp day, as days from 1970-01-01: day(“1970-02-01”) = 30
Hive day(ds) is the day part of a date or a timestamp string: day(“1970-02-01 00:00:00”) = 1, day(“1970-02-01”) = 1. Velox supported Presto day() function uses this definition.

So my understanding is:

if a user is reading from a Hive table using Prestissimo, then he/she should be able to run Prestissimo built in functions and Hive functions that is already supported by Velox.
if a user is reading from a Iceberg table using Prestissimo, then he/she should be able to run Prestissimo built in functions and Iceberg functions like day(ds). However in this case, the Iceberg day(ds) should be distinguished from Prestissimo built in day() function.

To avoid collisions, it would be ideal for Velox to adopt function namespaces so that, for example, hive.day() and iceberg.day() can coexist unambiguously. Hope this can be supported in the future.

jinchengchenghh · 2025-08-11T17:58:43Z

Could you help review again? Thanks! @mbasmanova @rui-mo

rui-mo

Thanks. Added some nits.

velox/docs/functions/iceberg/functions.rst

velox/functions/iceberg/BucketFunction.cpp

jinchengchenghh · 2025-08-20T16:08:39Z

Do you have further comments? @rui-mo

mbasmanova · 2025-08-27T16:31:14Z

velox/functions/iceberg/tests/CMakeLists.txt

+  velox_exec_test_lib
+  velox_expression
+  velox_memory
+  velox_dwio_common_test_utils


Curious, why is this dependency needed?

Removed, thanks!

jinchengchenghh · 2025-08-29T17:02:15Z

Do you have further comments? Thanks! @mbasmanova

facebook-github-bot · 2025-09-03T20:56:32Z

@Yuhta has imported this pull request. If you are a Meta employee, you can view this in D81621760.

jinchengchenghh · 2025-09-05T09:50:46Z

Could you help merge this one? Thanks! @Yuhta @mbasmanova

mbasmanova · 2025-09-05T11:36:43Z

@jinchengchenghh I see that Jimmy is working on merging this PR. Should land soon.

facebook-github-bot · 2025-09-05T16:52:27Z

@Yuhta merged this pull request in 3f7a211.

Summary: The implementation aligns with https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/transforms/Bucket.java And described in Iceberg document https://iceberg.apache.org/spec/#partition-transforms Resolves: facebookincubator#13980 Pull Request resolved: facebookincubator#13174 Reviewed By: kKPulla Differential Revision: D81621760 Pulled By: Yuhta fbshipit-source-id: a8051cbb2676a8db0fef95e41c5858004941b7ce

zhli1142015 · 2025-09-23T11:50:50Z

velox/functions/iceberg/BucketFunction.cpp

+
+  template <typename T>
+  FOLLY_ALWAYS_INLINE Status call(int32_t& out, int32_t numBuckets, T input) {
+    VELOX_USER_RETURN_LE(numBuckets, 0, "Invalid number of buckets.");


nit: return Status::UserError("Invalid number of buckets");

This is the new API VELOX_USER_RETURN_LE, it returns UserError inside the API

jinchengchenghh requested review from assignUser and majetideepak as code owners April 28, 2025 12:47

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 28, 2025

jinchengchenghh commented Apr 28, 2025

View reviewed changes

velox/functions/iceberg/util/CMakeLists.txt Outdated Show resolved Hide resolved

rui-mo reviewed Apr 30, 2025

View reviewed changes

velox/functions/iceberg/BucketFunction.cpp Show resolved Hide resolved

velox/docs/ext/iceberg.py Show resolved Hide resolved

assignUser reviewed May 7, 2025

View reviewed changes

velox/functions/iceberg/util/CMakeLists.txt Outdated Show resolved Hide resolved

CMakeLists.txt Show resolved Hide resolved

jinchengchenghh mentioned this pull request Jul 1, 2025

feat(iceberg): Support Iceberg partition transforms #13874

Open

rui-mo mentioned this pull request Jul 1, 2025

docs: Extend documentation tooling to support linking to Iceberg functions #13207

Closed

mbasmanova changed the title ~~feat: Support iceberg bucket function~~ feat(iceberg): Add bucket function Jul 2, 2025

mbasmanova reviewed Jul 2, 2025

View reviewed changes

jinchengchenghh force-pushed the iceberg_bucket branch from ea854c8 to bcd0630 Compare July 3, 2025 09:20

mbasmanova reviewed Jul 3, 2025

View reviewed changes

jinchengchenghh marked this pull request as draft July 3, 2025 13:23

rui-mo reviewed Jul 3, 2025

View reviewed changes

yingsu00 reviewed Jul 8, 2025

View reviewed changes

jinchengchenghh force-pushed the iceberg_bucket branch 2 times, most recently from 312015f to 324bee7 Compare August 11, 2025 15:48

jinchengchenghh marked this pull request as ready for review August 11, 2025 15:51

rui-mo reviewed Aug 13, 2025

View reviewed changes

mbasmanova reviewed Aug 27, 2025

View reviewed changes

jinchengchenghh force-pushed the iceberg_bucket branch from 9b2c5f9 to e1fdd29 Compare August 28, 2025 10:52

jinchengchenghh added 7 commits August 29, 2025 09:38

support iceberg bucket function

cffeb38

fix code style

44a03c1

address comments

94b807c

address comments

e148717

Use new API

64fe5b8

Use simple evaluateOnce API to test

23cd757

Remove duplicate library

dfcd4a5

jinchengchenghh force-pushed the iceberg_bucket branch from e1fdd29 to dfcd4a5 Compare August 29, 2025 08:38

jinchengchenghh added 2 commits August 29, 2025 10:10

fix code style

d655549

address comments

257c8e4

jinchengchenghh added 2 commits September 1, 2025 10:08

address comments

f1ce22f

remove unused using namespace

8d59d01

mbasmanova approved these changes Sep 2, 2025

View reviewed changes

mbasmanova added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Sep 2, 2025

facebook-github-bot closed this in 3f7a211 Sep 5, 2025

facebook-github-bot added the Merged label Sep 5, 2025

jinchengchenghh mentioned this pull request Sep 23, 2025

feat: Add Spark div function #14935

Closed

zhli1142015 reviewed Sep 23, 2025

View reviewed changes

Copilot AI mentioned this pull request Nov 6, 2025

Align Iceberg bucket function parameter order with Presto/SQL convention Joe-Abraham/velox#28

Closed

This was referenced Nov 6, 2025

Register Iceberg bucket scalar function in system schema #15422

Closed

[native] Register Iceberg bucket scalar function in system schema prestodb/presto#26558

Open

feat(iceberg): Add bucket function #13174

feat(iceberg): Add bucket function #13174

Uh oh!

Conversation

jinchengchenghh commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for meta-velox canceled.

Uh oh!

Uh oh!

jinchengchenghh commented Apr 29, 2025

Uh oh!

rui-mo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mbasmanova commented Jul 1, 2025

Uh oh!

jinchengchenghh commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jinchengchenghh commented Jul 3, 2025

Uh oh!

mbasmanova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mbasmanova Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rui-mo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yingsu00 left a comment

Choose a reason for hiding this comment

Uh oh!

mbasmanova commented Jul 8, 2025

Uh oh!

jinchengchenghh commented Jul 9, 2025

Uh oh!

yingsu00 commented Jul 12, 2025

Uh oh!

jinchengchenghh commented Aug 11, 2025

Uh oh!

rui-mo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jinchengchenghh commented Apr 28, 2025 •

edited

Loading

netlify bot commented Apr 28, 2025 •

edited

Loading

jinchengchenghh commented Sep 5, 2025 •

edited

Loading